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Almost all of us have multiple cyberspace identities, and these ct/6eralter egos are networked 
together to form a vast cyberspace social network. This network is distinct from the world-wide- 
web (WWW), which is being queried and mined to the tune of billions of dollars everyday, and until 
recently, has gone largely unexplored. Empirically, the cyberspace social networks have been found 
to possess many of the same complex features that characterize its real counterparts, including scale- 
free degree distributions, low diameter, and extensive connectivity. We show that these topological 
features make the latent networks particularly suitable for explorations and management via local- 
only messaging protocols. Cj/6eralter egos can communicate via their direct links (i.e., using only 
their own address books) and set up a highly decentralized and scalable message passing network 
that can allow large-scale sharing of information and data. As one particular example of such 
collaborative systems, we provide a design of a spam filtering system, and our large-scale simulations 
show that the system achieves a spam detection rate close to 100%, while the false positive rate 
is kept around zero. This system of letting c?/6eralter egos network among themselves has several 
advantages over other recent proposals for collaborative spam filtering: (i) It uses an already existing 
network, created by the same social dynamics that govern our daily lives, and no dedicated peer- 
to-peer (P2P) systems or centralized server-based systems need be constructed; (ii) It utilizes a 
percolation search algorithm (which can be viewed as mimicking how rumor is spread in a social 
network) that makes the query-generated traffic scalable; (iii) The network has a built in trust 
system (just as in social networks) that can be used to thwart malicious attacks; and (iv) It can he 
implemented right now as a plugin to popular email programs, such as MS Outlook, Eudora, and 
Sendmail. 



I. INTRODUCTION 



A. CyberAlter Ego and the Pervasive Cyberspace 
Social Networks 

Our socioeconomic activities are getting intricately en- 
twined witli our identity in the cyberspace, and perhaps 
we are witnessing the emergence of an alter ego in the 
cyberspace. For example, every email user can construct 
a list of email addresses from which he has received 
emails or sent emails to; this constitutes one's cyher- 
neighborhood. This list is stored in the address books or 
contact lists managed by one's email client software or 
by the ISPs that one uses. It can be also automatically 
constructed by just sifting through one's mail box. Indi- 
viduals on such lists have their own address books and 
contact links and soon there is a cyberspace network, in 
which our identities or cyberaltei egos are firmly embed- 
ded and occupy various positions of power, centrality, 
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or proximity to cyber-communities of potential interest. 
Thus, an undirected social email network can be defined 
as follows: the nodes in the network correspond to email 
addresses; a pair of nodes is connected by an edge if a 
message is exchanged between the two nodes. Similarly, 
a directed social email network can be defined as follows: 
nodes also correspond to email addresses; a directed edge 
points from A to i? if node A has sent an email to node 
B and vice versa[21]. One can modify this network to in- 
corporate other parameters of interest; for example, each 
edge can be assigned a weight based on the number of 
email messages exchanged, or time-stamps can be added 
to messages along each edge so that one can prune the 
network to reflect the recent status of interactions among 
the cyberalter egos. 

A major obstacle to studying such email networks has 
been that contact addresses and lists of a large enough 
group of cyberalter egos are not available in the public 
domain. Even though large ISPs, such as Hotmail, Ya- 
hoo, and AOL, have this information for all their users, 
they are not for public consumption. Drawn by the com- 
mercial potential of these latent networks, a number of 
companies [6] have started providing services where par- 
ticipants can upload their address books, allowing the 
corporation to create a central server where the social 
email network is stored and updated; the goal is to pro- 
vide services to the participating clients by mining the 
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network. These networks, however, are also proprietary, 
because of both privacy and commercial secrecy reasons. 
Fortunately for us (i) the system that we have designed 
do not make use of the knowledge of the complete net- 
work; to carry out the protocols described here, the cy- 
beraltei egos have only to exchange messages with those 
on their own contact lists and do not have to know about 
its cyber-neighbor lists, and (ii) A few examples of social 
email networks have been thoroughly studied in the liter- 
ature, allowing us to observe that they share many of the 
same complex features as real world social networks. In 
particular, we will use the network analyzed by Ebcl et. 
al. in a recent work [11], which shows that the network 
has a scale-free structure, short diameter, and a giant 
connected component (gcc) that contains more than 95% 
of the nodes. 

Since our cyberaltei egos are becoming more en- 
trenched as a significant part of our overall social and 
commercial selves, can one start managing and utilizing 
their network the same way that we manage our real-life 
social networks? Any such efi'ort should abide by rules, 
such as the need to protect the privacy of the users and 
also the need to allow participants to dynamically decide 
whether they want to participate or not. The primary 
contribution of this paper is to provide a decentralized, 
efficient, and scalable system for querying and sharing 
information on the global social networks. One major 
application of this overlay information management sys- 
tem is to filter spam, as reviewed in the following. 



B. Spam and Content- btised Spam Filtering 

Spam, or Unsolicited Bulk Email, is plaguing internet 

users around the world. It has been estimated that ap- 
proximately 68% of the worldwide email traffic today is 
spam and up to 87% of the emails directed to US users 
is spam [3]. 

For the past few years, numerous spam filters have 
been proposed and deployed, and of all the existing 
anti-spam solutions, two classes of spam filters have 
emerged as the most effective and widely-deployed: 
Bayesian/rule-based spam filters and collaborative spam 
filters. A Bayesian filter uses the entire context of an 
e-mail in looking for words or phrases that will iden- 
tify the e-mail as spam based on the experiences gained 
from the user's sets of legitimate emails and spams[12]. 
One example of a widely deployed Bayesian spam filter 
is Spam Assassin [4]. Although the Bayesian anti-spam 
solutions offer very impressive performances, they suffer 
from several serious drawbacks: first, Bayesian filters re- 
quire an initial training period and exhibit a downgrade 
in performance for responding to messages composed of 
previously unknown words; second, Bayesian filters are 
unable to block messages that do not look like a typi- 
cal spam such as messages that is consist of only a URL 
or messages that are padded with random words. Most 
recently a number of multifaceted approaches have been 



proposed [7, 17]. They consider combining various forms 
of filtering with infrastructure changes, financial changes, 
legal recourse, and more, to address shortcomings of reg- 
ular statistical filters. 



C. Collaborative Spam Filtering: Prior Work and 
Challenges 

The increasing realization that the dynamic of spam 
constitutes a complex phenomenon brewed, fostered 
and propagated in the interconnected realm of the cy- 
berspace, has prompted the use of collaborative spam 
filters, where the basic idea is to use the collective mem- 
ory of, and feedback from, the users to reliably iden- 
tify spams. That is, for every new spam that is sent 
out, some user must be the first one to identify it upon 
receiving this spam (e.g., by using a Bayesian filter or 
locally generated white and black lists); now, any sub- 
sequent user that receives an email that is a suspect, 
can query the community of email users to find out if 
it has been already tagged as spam or not. In con- 
trast to Bayesian type filters, collaborative spam filters 
do not suffer from the drawbacks just mentioned above, 
and it has been shown that they are also capable of su- 
perior spam detection performance[22]. The existing 
collaborative filtering schemes mostly ignore the 
already present and pervasive social communities 
in the cyberspace and try to create new commimi- 
ties of their own to facilitate the sharing of information. 
This unenviable task of creating new social communities 
is beset with several difficulties that have limited the de- 
ployment and effective use of most collaborative filtering 
schemes proposed so far. The challenges include: 

(i) How to find users to participate?: In order for a 
collaborative spam filter to be highly effective, a large 
number of users (on the order of hundreds of thousands 
or millions) must be participating in using the system. 
However, effectively finding and interconnecting a large 
number of willing participants is non-trivial. In other 
words making any artificially established community ac- 
ceptable and popular is an unpredictable and difficult 
task at best, and impossible at worst. 

(ii) How to make the search scalable?: The power of a 
collaborative spam filter lies in the fact that spam data 
resources from a large number of users are pooled to- 
gether and utilized to fight spam. In order to avoid high 
server cost, the spam databases are typically stored lo- 
cally on users' computer. Finding a way to do efficient 
searches on a network of distributed databases is very 
challenging. 

(iii) Who to trust: Inevitably, there would be malicious 
users who try to subvert the collaborative anti-spam 
system by providing false information regarding spam. 
Therefore, a trust scheme must be devised to place more 
weights on the opinions of some provably trustworthy 
users than on some unknown users who can be poten- 
tially malicious. 
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The different proposed schemes for collaborative filter- 
ing attempt to address the above challenges to different 
degrees of effectiveness. For example, SpamNct[5] em- 
ploys the following mechanisms to address the challenges 
stated above: It uses a central server model to connect 
all the willing participants of this collaborative spam fil- 
ter. The central server solution is not scalable as the 
system scales and the server becomes a single point of 
attack or failure. In addition, SpamNet employs a com- 
plicated algorithm to compute the trust score for each 
of its user. SpamWatch[20] is a totally distributed spam 
filter based on the Distributed Hash Table (DHT) sys- 
tem Tapestry [19]. Spam Watch addresses the three chal- 
lenges of a collaborative spam filter in the following ways: 
First, Spam Watch uses a DHT-based P2P system to con- 
nect all the participants. The primary drawback in using 
a DHT for collaborative spam filtering purpose is that 
DHT's do not provide a natural platform to network 
existing databases, such as every email user's personal 
database of spams. Merging and mining existing sets 
of databases is very difficult if not impossible. Second, 
Spam Watch uses a hash-based mechanism called Approx- 
imate Text Addressing (ATA) to perform general query 
searches for spams. However, as seen in the description 
of the ATA algorithm, supporting general query search 
in a DHT is very complex and involves expensive oper- 
ations. DHT's typically excel at exact-match lookups 
but does not perform well for application that needs to 
support general search queries such as in a collaborative 
spam filter. Lastly, Spam Watch does not offer any mech- 
anism to address the trust issue. Most recently Gray et. 
al. have proposed CASSANDRA, a collaborative spam 
filter where the network is formed as clusters of trusted 
and similar peers. Finally a new reputation analysis have 
been proposed by Golbeck et. al. [13] where reputation 
relationships are inferred from the structure and are used 
as a method to score emails. 



D. Harnessing The Global Social Email Network 

Recently, Boykin and Roychowdhury investigated the 
notion of utilizing social network to do spam filtering [8]. 
In their work, it was shown that just by looking at the 
clustering coefficient of an email user's personal contact 
networks, their algorithm is able to achieve a spam de- 
tection rate of 53% with zero false positives. Although 
this algorithm is very attractive, it ignores the larger so- 
cial email network and focuses only on a projection of 
it, as witnessed by an individual user, and it begs the 
questions whether the larger social email networks can 
be harnessed. 

In this paper, we show that a high-performance, scal- 
able and secure information management and query sys- 
tem can be overlaid on the social email networks, and 
provide a case study for collaborative spam filtering. The 
basic idea is the same as that of other proposed collab- 
orative spam filters; however, instead of using special- 



ized network, we use the latent social email network over 
which the queries and messages are exchanged. We show 
how the three challenges outlined in the preceding discus- 
sions can be effectively addressed using the topological 
properties of the underlying social email networks and 
recent advances in complex networks theory. First, no 
especially designed network has to be created for collab- 
orative filtering. In fact, one of the main features of this 
system is that all queries and communications are ex- 
changed via email through personal contacts, and that 
no server or a traditional P2P system with TCP/IP con- 
nections is needed. Second, we observe that social email 
networks correspond to Power-Law (PL) graphs [23] [11], 
with a PL coefficient around 2. Hence, the underlying 
network naturally possesses a scale-free structure that is 
a key hall-mark of many unstructured P2P systems that 
have organically grown for file-sharing on the Internet. 
One can then utilize a scalable global search system, 
namely the percolation search algorithm, recently pro- 
posed by Sarshar et. al. [18], on this naturally scale- free 
graph of social contacts to enable peers to exchange their 
spam signature data. Third, one can harvest and utilize 
the trust that is emhedded in the web of email contacts. 
By regarding contact links as local measures of trust and 
using a distributed Singular- Value-Decomposition (SVD) 
algorithm, we can obtain a trust score called mailtrust. 
In fact, the famous Google PageRank[9] is computed in 
a similar fashion. Finally, the proposed system can be 
implemented right now as plugin to popular email pro- 
grams, such as the MS Outlook. 

We show via extensive simulations that the system is 
also capable of delivering high performances while incur- 
ring minimal costs. Under the assumption that there 
would be a large number of users (on the order of hun- 
dreds of thousands or millions), the system can offer a 
spam detection rate around 99%; in fact, the detection 
rate can reach close to 100% when the number of users 
approach the internet scale. At the same time, the num- 
ber of false positives in our system can be tightly con- 
trolled to a level very close to zero. Meanwhile, as the 
number of users of the system scales, the communication 
cost of the system would be kept at a sublinear scale and 
the memory storage cost would grow only at a logarithmic 
scale. In addition, due to the fact that no TCP/IP con- 
nection is required and all communications in the system 
is done via background email exchanges, less computa- 
tional and networking burden would be placed on local 
computers. Lastly, the system is designed to be secure 
and rigorously protective of users' privacy and confiden- 
tiality. 

The rest of the paper is organized as follows. In sec- 
tion II, we present the background theory and the impor- 
tant concepts vital to this paper, such as email network 
theory and the percolation search algorithm. In section 
III, we describe the protocol of our social network based 
collaborative anti-spam system in detail. In section IV, 
we use a real world email network to perform large-scale 
simulations of the system. In section V, we construct a 
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threat model and show by simulation that a social net- 
work based trust scheme is effective in minimizing dam- 
ages caused by malicious users. Finally, in section VI, we 
address several important topics such as the protection of 
privacy and the system's resilience against random user 
failure. 



II. BACKGROUND CONCEPTS 

Our system is motivated by a number of recent ad- 
vances in complex networks theory and systems, Eigen- 
methods based computation of trust and relevance, and 
the proven efhcacy of the spam digest system as signa- 
tures of emails. We briefly review this background ma- 
terial in this section. 



A. Topology of Social Email Networks 

A particular email network comprising 56,969 nodes 
(i.e., email addresses) has been studied by Ebel et. al.[ll] 
Based on the statistics reported in Ebel's work, we iden- 
tify three desirable properties that would make social 
email networks an attractive platform for building a col- 
laborative spam filter: 

(i) An email network has been found to possess a scale- 
free topology. More precisely, for the email network ex- 
amined in [11], the node degree distribution follows a 
power law (PL): P{k) cx k~^'^^, where k is the node de- 
gree, and P{k) denotes the probability that a randomly 
chosen node has degree equal to k. One of the conse- 
quences of this property is that of very low percolation 
threhold[18]; in other words, the network is extremely re- 
silient to random deletions of nodes. One can also show 
that even if high-degree nodes are deleted preferentially, 
one has to remove almost all the high- degree nodes, be- 
fore the network gets fragmented. 

(ii) A large fraction of the nodes (95.2%) in a social email 
network is connected to the giant connected component 
(GCC). This means that any node can reach almost any 
other arbitrary node by simply following email links. 

(iii) The email network has a low diameter (i.e. there ex- 
ist short paths between almost any pair of two nodes in 
the network). In fact, for the email network investigated 
by Ebel et. al.[ll], the mean shortest path length in the 
giant connected component was found to be Z = 4.95 for 
a component size of 56, 969 nodes. This short-diameter 
property allows any email user to efficiently communicate 
with any other email user in the network by crossing only 
a few email contact links. 

The above properties of the social email network 

should not come as a surprise, since it reflects the same 
social dynamics that we practice in our everyday life. 
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FIG. 1: Percolation Seeirch On Social Email Networks: 

{a) The hit rate, fraction of links and fraction of nodes tra- 
versed as a function of tlie percolation probability. Notice 

that there is a sudden jump in the hit rate above tlic percola- 
tion threshold, wfiile the fraction of links and nodes processing 
the searcii query increases only linearly, after the threshold. 
The network used in this percolation search simulation is a 
real-world email contact network. The number of nodes is 
56,969, r « 1.81, the TTL is 50 for both query and content 
implants and only one unique content exists in the network, 
(b) Hit rate for percolation search on email contact network 
with TTL of 50. Repeating the percolation trial multiple times 
pushes the hit rate exponentially dosed to 1. 



B. Percolation Search and Scalability 

We can utilize the percolation search algorithm pro- 
posed by Sarshar et. al.[18] that exploits the presence of 

a tightly connected core comprising mostly high-degree 
nodes. In particular, it is shown in [18] that unstructured 
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searches in PL networks can be made highly scalable using 
the percolation search algorithm. The algorithm involves 
message passing on direct links only, and in some sense it 
resembles how rumors propagate in social networks. The 
key steps of the algorithm are as follows: 

(i) Caching or Content Implantation: Each node per- 
forms a short random walk in the network and caches its 
content list on each of the visited nodes. The length of 
this short random walk is specified later and is referred 
to as the Time To Live (TTL). 

(ii) Query Implantation: When a node intends to make 

a query, it first executes a short random walk of the same 
length as step 1 and implants its query requests on the 
nodes visited. The length of this random walk is usually 
taken to be the same as the TTL used in the content 
implantation process. 

(iii) Bond Percolation: All the implanted query requests 
are propagated through the network in a probabilistic 
manner; upon receiving the query, a node would relay 
to each of its neighboring nodes with percolation proba- 
bility p, which is a constant multiple of the percolation 
threshold, Pc, of the underlying network. 

It is shown in [18] that the percolation threshold of 
any random network is given as Pc = {k) / (fc^). For a PL 
network with exponent r and maximum degree kmax i we 
have {P) = 0{k^JJ and (fc) = 0{k^JJ, and hence, we 
get a percolation threshold of Pc = which is 

vanishingly small if kmax increases with the size of the 
network, which is usually the case. Thus, if we perco- 
late at a multiple 7 of Pc, then the total traffic generated 
would be, Cr = lPc{k)N = O(^) = O (k-l+'N). In 
real world networks, kmax typically scales siiblinearly as 
a function of the network size. For kmax — 0{N^/'^), we 
have: Cr = O {kml+^N) = 0{N^). For a detailed anal- 
ysis of the hit rate and how it behaves as one performs 
multiple searches see [18]. 

Since the social email networks have a PL degree dis- 
tribution, it is ideally suited for reaping the benefits of 
a percolation search, and the simulation plots obtained 
from performing pca'colation search on the real email 
dataset [1, 11] are provided in Fig. 1. 

C. The MailTrust Algorithm 

Just as in the case of WWW, where the PageRank, cap- 
tures the relevance of a particular web page, the topolog- 
ical structure of the social email networks can be used to 
assign trust or reputation to individual users. First, we 
model each email contact as placing a unit of trust on 
the recipient. Thus, for a node that contacts koui other 
nodes, we can compute the fraction of trust that this 
node places on each of his out-neighbors as followed: the 
trust for neighbor i, ti, is equal to the number of emails 
sent to neighbor i divided by the total number of emails 
sent. Note that the collection of tj's forms a probability 
vector, called the personal trust vector t . Thus, if we 
model the entire email network as a discrete time Markov 



chain, the local trust vector, t , becomes the transition 
probability function for each node. We then compute 
the steady state probability vector using Power Iteration 
method which is the the same algorithm adopted to com- 
pute pagerank score of documents on web [9, 14] . As dis- 
cussed in the literature, one needs to make sure that this 
Markov chain is ergodic and this can be achieved by hav- 
ing nodes with zero out-degree assign uniform trust to a 
set of pre-trusted nodes who have been carefully picked. 
An alternate way to compute the MailTrust score in a 
distributed fashion can be found in [14], along with a 
scheme on how the trust scores can be kept securely in 
the system even with the presence of malicious users. 



Trust = .298 




Trust = .088 



FIG. 2: MailTrust: A simple illustration of the MailTrust 

algorithm. The numbers in parentheses represent the local 
trust values that each node places on his/her neighbors. The 
MailTrust scores for cacii node is then obtained by computing 
steady state probability vector of the Markov chain. 

We will refer to this trust score as MailTrust in the rest 
of this paper. A plot of the MailTrust scores obtained 
from [1, 11] is shown in Fig. 3. 

D. Digest-based Spam Indexing 

In a collaborative spam filtering system, it is important 
to have an effective mechanism to index known spams so 
that subsequent arrivals of the same spam can be cor- 
rectly identified. The collaborative design of the system 
does not depend on any specific algorithm, but for ini- 
tial experimental results we have adopted the well known 
digest-based indexing mechanism [10] to share spam in- 
formation between users. Damiani et. al. have rigorously 
demonstrated that the digest algorithm described in [10] 
is highly resilient against the possible forms of automatic 
modifications of spam emails. The digest algorithm is 
further shown to satisfy both the privacy preserving and 
that it produces almost close to zero false positives (i.e., 
the digest of one email matches the digest of an unrelated 
email) . 
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Algorithm 1 PROCESS-MAIL(Email E) 

1: if DefinitelySpam(£J) then 

2: Mark E as Spam 

3: else if DefiiiitelyNotSpam(i;) then 

4: Mark E as not Spam 

5: else 

6; De = Digest (E);{ Gray SPAM}; 

7: Implant percolation of De on a random walk of length 
/ 

8: Wait(T); 

9: He = HitScoreO; 

10: if He < threshold then 

11; Mark E as not Spam 

12: else 

13: Mark E as spam 

14: end if 

15: end if 



FIG. 3: MailTrust Distribution: The probability density 
function of MailTrust scores using 10** bins. Tliese scores are 
obtained by applying the MailTrust algorithm on the email 
network data set from [11]. Notice that this probability den- 
sity function is heavy-tailed, indicating that a few nodes are 
much more trustworthy than most nodes. 



III. IMPLEMENTATION AND SYSTEM 
PROTOCOL 

In order to use our proposed collaborative spam filter- 
ing system, an interested individual must first obtain a 
simple client program that works as a plug-in to an email 
program such as MS Outlook, Eudora, Sendmail, etc[24]. 
This simple client will only need to provide the follow- 
ing features: first, the client must come with a digest- 
generating function as specified in section II D; second, 
the client is responsible for keeping a personal blacklist 
of spams for the end-user as well as caching blacklists of 
spams for other nodes as described in the section on the 
percolation search algorithm, (see section II B); third, the 
client would have access to the list of social email con- 
tacts (both inboimd and outboimd) of the end-user. The 
pseudo code of the distributed client is given in Algo- 
rithm 1. 

Message Arrivals and Digest Indexing: When 
an email message arrives at the end-user, the 
method checks whether it is definitely spam or not 
spam(DefinitelySpam/DefinitelyNotSpam). Any tradi- 
tional spam filtering method like white-list, blacklist, 
Bayesian filter, etc. can be integrated to create a hybrid 
multi tier architecture. DefinitelyNotSpam for example 
can be a white list of addresses in the contact list and 
DefinitelySpam c:an be output of a Bayesian filter when 
the filter indicates email as spam with very high prob- 
ability. If an email is then suspected to be spam, the 
client program would call the digest function to generate 
a digest, D^, for the message. 

Making a Query in the System: Now, we would 
query the system to find out whether any other user in 
the network already has the digest, D^, on its spam list. 



Algorithm 2 Pubhsh-Spam(Email E) 

1: De = Digest(E); 

2: Implant De on a random walk of length I 



Each query message for this digest is then implanted on 

a random walk of length I. Nodes with an implanted 
query request will then percolate the query message con- 
taining the digest, De, through their email contact net- 
work using a probabilistic broadcast scheme as specified 
in the bond percolation step of the percolation search al- 
gorithm. Each node visited by the query would declare a 
hit if the digest, Dg, matches with any of the digest that 
is cached on that node[25]. All the hits would be routed 
back to the node that originated the query through the 
same path that the query message arrives at the hit node. 
If the nodes have trust scores, then returned hits include 
the their trust score as well. 

Processing the Hits and Making the Decision: 
After all the hits are routed back, HitScore is then cal- 
culated as (or as the weighted sum if using trust scheme; 
see Section II C) sum of all the positive hits. If it exceeds 
a constant threshold value, the message in question is 
declared as spam; otherwise, the email message is deter- 
mined to be non-spam. 

Publishing Digest: If an email is declared as spam, 
and placed in the user's "spam" folder then the Puhlish- 
Spam function would be called that generates the digest 
of the spam message, De and caches the digest on a short 
random walk, as specified in the caching or content im- 
plantation step of the percolation search algorithm. 

System Maintenance: If the EigenTrust algorithm 



Algorithm 3 HitScore(Hits) 
1: if Using MailTrust then 
2: HitScore = 'E^^jj^^^^mailtrust{h) 
3: else 

4: HitScore = \Hits\ 
5: end if 

6: Return HitScore; 
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Content Implantation: SI and S2 implant 
their blacklists of known spams 
through random walks (ttl = 2). 




II. Query Implantation: A receives a 
suspected message and initiates a query 
implantation through a random walk (ttl = 2). 




Suspected Message 
Received 



III. Bond Percolation: Ql and Q2 initiate the 
bond percolation process; The query 
messages found hits at CI and C4. 




IV. Hit Route-Back: The hits are 
routed back to node A through the 
same paths. 



Hit Node 




Hit Nod 



Node A receives 
two hits 



FIG. 4: An illustration of the protocol of the system. 



from section II C is implemented, we would need to up- 
date the trust scores of the nodes on a periodic basis. 
Since most people's amount of email contacts change 
not faster than a daily basis, The distributed EigenTrust 
computation should be performed at most once a day to 
obtain up to date trust scores for all nodes. Connectiv- 
ity of the network is maintained by simple background 
message declaring join/leave sent to each of the user's 
contacts. 



IV. SIMULATION AND SYSTEM 
PERFORMANCE 

Network Model: In this section, all simulations are 
performed on a real-world email network investigated in 
Ebel et.al.'s work [11]. (The email network data can be 
obtained via this url [1].) In the following simulations, 
only the giant connected component is used, which con- 
tains 95.2% of all nodes in the original dataset. Please 
see table I for the specific values of this email network's 
parameters. 

Spam Arrival Model: We model the spam detection 
performance of a collaborative spam filter as a function 
of the number of copies of the similar spam messages 
that arrive to the system. In the extreme case that every 
spam arrived to the system is unique, one can easily see 
that a collaborative filter would be totally futile, since no 
user can benefit from the prior identifications of others. 



Assuming that similar spam messages are sent to ap- 
proximately 5 million internet users on average [26] and 
estimating internet users to be 600 million worldwide [2]; 
thus, assuming that spammers select spam targets uni- 
formly randomly from the set of all internet users, the 
probability that any individual would receive a copy of a 
given spam is approximately 0.8%. Since there are 56,969 
nodes/users in our email network, the approximate num- 
ber of identical spams arrived to this network is about 
500. We further assume that each spam message arrives 
at nodes of the network uniformly randomly. [27] 

Specification of Percolation Probabilities: Re- 
call that in the percolation search algorithm, each edge 
gets a message with probability p which is chosen to be 
a constant multiple of the percolation threshold of the 
network. In general, the percolation threshold might not 
be known, and so one needs to come up with a scheme 
to adaptively perform the search using an increasing se- 
quence of percolation probabilities. In order to ensure a 
high hit rate for queries and a low communication cost for 
the system, we propose the following scheme to perform 
query searches: we start the first query with very low 
percolation probability; if not enough hits are returned, 
we send out a second query with a percolation probabil- 
ity that is twice of the first one; if still not enough hits 
are routed back, we repeat the searches by increasing the 
percolation probability in this two-fold fashion until the 
probability value reaches a maximum value, Pmax] once 
this maximum is reached, we repeat the query with the 
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maximum percolation probability for a constant number 
of trials and stop. The query search is terminated as 
soon as the total number of distinct hits routed back 
reaches the threshold after any given trial. If no hits are 
returned after rirep attempts at the maximum probability 
Pmax, then the search is terminated and the queried item 
is considered as absent. 

For the simulation experiment in this section, we set 
the starting percolation probability to be .00625 and 
Pmax to be .05, and Urep = 3. All other relevant param- 
eters of the experiment are specified in Table I. In addi- 
tion, we assume in this simulation experiment that upon 
the receipt of a new spam message, all nodes immediately 
cache this new content on a random walk as specified in 
the percolation search algorithm. This is done regardless 
of the fact that the spam has been automatically filtered 
by the system or it leaked through the filter and must be 
identified by human inspection or by some other means. 
Simulation Execution: The simulation is repeated for 
30 runs. In each run, 500 copies of the same spam arrive 
sequentially at different nodes in the network. The nodes 
are selected uniformly randomly for each spam message 
arrival. The first node receiving it performs a search, but 
of course gets hits; similarly, the second node will also 
get at most one hit and it will be below the threshold 
of 2 to be identified as a spam. For the first two nodes, 
after the searches return no hits, the messages are manu- 
ally tagged as spams. Since these two initial searches will 
be considered as misses, the maximum detection rate is 
498/500 = 99.6%, where the detection rate is simply the 
number of successful spam detections divided by the to- 
tal number of spam arrivals. We record the the detection 
rate for each run, compute the overall average and stan- 
dard deviation, and plot the results in error-bar plots. 
In addition to the detection rate, we also record the per- 
centage of edges crossed per query, which is the primary 
metric for network traffic cost. We repeat the simulation 
by varying one parameter: rireps > which is the number of 
query trials repeated with percolation probability set at 
Pmax before declaring failure. 

Simulation Results Analysis: Fig. 5 plots the sim- 
ulated spam detection rate (in percentage) as a function 
of Ureps, averaged over 30 runs. Note that for Ureps ^ 3, 
the spam detection rate is extremely close to the maxi- 
mum detection rate of 99.6% for this experiment 

Fig. 6, plots the percentage of edges crossed for per 
query as a function of rireps, averaged over 30 runs. (A 
query is defined as a series of percolation search trials as 
defined in the subsection above.) Note that the network 
traffic cost is extremely low: on average, only approx- 
imately 0.1% of the 84.190 email links in the network 
needs to be crossed in order to get enough query hits to 
identify a suspected message as spam. Combining results 
from Fig. 5 and Fig. 6, one can argue that n^eps = 3 is a 
good operation point, since it gives near optimal spam de- 
tection performance while incurring minimal traffic cost. 

Fig. 7 shows the network traffic as the average number 
of messages processed by nodes with degree k. 
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FIG. 5; Spam Detection Performance: This figure plots 
the simulated spam detection rate (in percentage) as a func- 
tion of the number of query trials repeated with percolation 
probability set at Pmax before declaring failure. Note that all 
the average detection rates are well above 99%. The results 
are averaged over 30 runs and the error bar plots one standard 
deviation above and below the mean. 
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FIG. 6: Overall Traffic Per Query: This figure plots the 
percentage of edges crossed per query as a function of Ureps, 
average over 30 runs. Note that traffic cost is extremely low 
(only around 0.1% of network links need to be crossed per 
query). The error bar plots one standard deviation plus and 
minus the mean. 



Fig. 8 shows the average number of participating nodes 
in a query as function of node degree. As expected that 

high-degree nodes are more likely to be visited for any 
given query since they are connected to a large number 
of nodes. 

Bandwidth Cost Estimates: Fig. 6 shows that the 
required traffic for each query is about 0.1% of edges, 
which corresponds to about 84 emails. Moreover, ev- 
ery short email containing the digest of a message is 
about 1 KByte in size, and every email incurs band- 
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Network 


# of nodes 


56,969 




# of edges 


84,190 




Node; degree distribution 


fower-Law yrL) 




PL exponent 


^ i.O 




mean node degree < fc > 


z.yo 




node degree 2nd moment < fc^ > 


174.937 




approximate percolation threshold (qc) « 


.0169 




tirno-to-live (ttl) 


50 


Simulation Paxam. 


# of arrivals of the same spam 


500 




threshold (# of hits needed to identify spam) 


2 




percolation probability trials 


[.00625 .0125 .025 .05 .05 . . .] 




# of runs 


30 


Threat Model 


# of time steps 


25 




# of malicious nodes inserted per time step 


10 




total # of mailing lists 


50,000 




Zipf coefficient 


0.8 




# of non-spams queried per time step (x) 


1,000 




m, number of items on a blacklist 


10 




% of user's non-spam to be queried 


5% 



TABLE L Simulation Settings 




Node Degree, k Node Degree, k 



FIG. 7: TrafHc vs. Degree: The data points show the av- 
erage number of messages processed per percolation query for 
a node with degree k (i.e. it is the total number of messages 
processed per query for all nodes in the network with degree 
k divided by the number of nodes with degree k.) This plot 
is obtained by using an rireps value of 3 for every percolar- 
tion query. The slope of the linear fit is 0.0019 query/degree, 
since each node forwards a query to a link with a fixed perco- 
lation probability, we naturally expect that high-degree nodes 
handle more messages. 



width cost on both the sender and receiver. Thus, 
the bandwidth cost per query is approximately (84-1-50) 
email exchanges (where the number 50 corresponds to the 
random- walk query implantation with TTL=50), which 
at 1 KBytc/cmail results in a total of 268 KByte per 
query. This total traffic per query is distributed among 
all the nodes, and in particular more among the high- 
degree nodes, as shown in Fig. 7. 

Let us consider the worst case scenario first. In the 



FIG. 8: The data points show the fraction of nodes with 

degree k visited per percolation query (i.e. it is the number 
of nodes with degree k visited per query divided by the total 
number of nodes in the network with degree k.) This plot is 
obtained by using an nreps value of 3 for every percolation 
query. The slope of the linear fit is 1.573xl0~*/degree. 



network used for this simulation, a very high-degree node 
typically processes around 1.5 messages per query in the 
network, as seen from Fig. 7; only one set of nodes uses 
more than this value. Assuming that every user gets 
1 spam per hour, we conclude that a very high-degree 
node would need to process about 85,500 messages per 
hour since there are around 57,000 nodes in the network. 
Since the query message size is 1 KByte, the bandwidth 
cost on high degree nodes would be around 85 MByte per 
hour[28], which is equivalent to around 0.18 Mb/second. 
For a typical fast internet connection of 100 Mb/second, 
this represents about .2% of bandwidth cost. 
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For nodes with lower degree, the cost is substantially 
lower. For example, even a node with degree 100 would 
process on the average 0.19 messages per query, and 
hence, using the same estimate of 1 spam per node 
per hour, the bandwidth costs would be only around 
23K/second. 



V. THREAT MODEL AND EFFECTIVENESS 
OF TRUST SCHEME 

In this section, we will construct a model of malicious 
users in the network trying their best to subvert the sys- 
tem. Through a large-scale simulation, we will demon- 
strate that implementing the EigcnTrust algorithm pre- 
sented in section II C can effectively reduce the damage 
inflicted by the malicious users. 

With the system introduced so far, a m,alicious node 
can subvert the system by introducing blacklists of well- 
known valid messages into the network. [29] As a result, 
messages from mailing lists become easy targets of an 
attacker. Note that this form of attack will only raise 
the false positive rate of the system and it has no impact 
on the spam detection rate. Every malicious node will 
pick a fixed set of mailing lists and periodically update 
the blacklist with new messages from the mailing lists. 
In addition, it is assumed that the popularity of mailing 
lists follows a Zipf distribution and the probability that 
a mailing list is being queried follows the same Zipf law. 
We further assume that the spammer wants to inflict 
maximum damage and thus will select a given mailing 
list to blacklist following the same Zipf distribution for 
popularity since users of the system are more likely to be 
subscribed to popular mailing lists. 



A. Simulation Setup and Trust Scheme 

The simulation setup and parameters in this section 
will be identical to the simulation performed in section 
IV, except for the following: first, a small fraction of 
nodes in the network (250 nodes) will be labelled as ma- 
licious nodes and these malicious nodes will blacklist non- 
spams from popular mailing lists; second, for simulation 
purpose, we assume that the probability that a node in 
the email network is malicious is inversely proportional 
to its in-degree, since low in-degrec nodes are trusted by a 
few peer email users and thus more likely to be malicious; 
third, the malicious nodes will follow all speciflcations of 
the protocol such as forwarding and routing queries, stor- 
ing the cache implants for other nodes, etc. [30]; fourth, 
we relax the uniform spam arrival assumption in section 
IV. In this simulation, the probability that a node re- 
ceives a spam is directly proportional to its in-dogroc. 
The justiflcation for this assumption is that a high in- 
degree node signifles very active and long-time usage of 
the email account and thus more likely to receive spams. 
All relevant parameters are specifled in Table I. 



We then perform a Monte Carlo simulation on email 
network as follows; at every time step, ten malicious 
nodes would insert their malicious content, which consist 
of blacklists of non-spams; also, 500 copies of the same 
spam message arrive as in section IV; In addition to the 
spam arrivals, a constant number of non-spams would 
arrive and queried by users; based on the hits that are 
routed back, nodes would classify the messages queried 
to be spam or non-spam. 

We will use two methods for spam classiflcation: the 
non-trust scheme and the MailTrust scheme (for spec- 
ification of the MailTrust algorithm, sec section II C). 
Under the non- trust scheme, a suspected message is clas- 
sified as spam if the number of distinct hits routed back 
is greater than or equal to a threshold (the threshold is 
set at 2 to give comparable performance as in section IV) . 
For the MailTrust scheme, the queried message is iden- 
tified to be spam if the sum of the MailTrust scores of 
the distinct hits routed back is above a threshold. This 
threshold is set to generate comparable spam detection 
rate as the non-trust scheme. The results are plotted in 
Fig. 9 for spam detection rate and false positive rate as 
a function of the number of malicious nodes inserted. As 
shown in the plot, the malicious nodes have no impact 
on the spam detection rate, since their blacklists of non- 
spams do not affect the ability of other normal users to 
blacklist and identify spams. From the spam detection 
rate plot, one can see that both schemes generate compa- 
rable spam detection rates. However, by examining the 
false positive rate plot, one can immediately see that the 
MailTrust scheme results in about 50% improvement in 
lowering the false positive rate. 

The reason for this improvement is mainly due to the 
fact that high in-degree nodes tend to have high trust 
scores and receive more spams. Thus, for the MailTrust 
scheme, we can set the threshold value a little high and 
still have a very good spam detection rate, because a 
large fraction of query hits for spams will be provided 
by the high in-dcgrcc nodes who have high trust scores. 
In addition, most malicious nodes have low trust scores 
since they tend to have low in-degrees (this assumption 
is made in the subsection above). 

VI. MISCELLANEOUS 

A. Protection of Privacy 

Since our proposed anti-spam system is social network 
based, it is very important to protect users' privacy by 
preventing anybody from using the network to map out 
social links. Furthermore, if a malicious individual is 
able to map out the social email network, a database 
of social contacts can be constructed to send out more 
spams from spoofed personal contacts. To address this 
problem, all messages exchanged in the system must be 
forwarded anonymously. The basic idea is that when a 
node forwards a message, any information pertaining to 
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FIG. 9: MailTrust Performance: The top and bottom fig- 
ures plot the spam detection rate and the attack effectiveness 
rate as a function of the number of malicious nodes that have 
joined the system. While both schemes yield approximately 
the same detection rate, the MailTrust scheme results in a 
significantly lower attack effectiveness rate. Note that we are 
assuming it is easy for malicious nodes to know of messages 
(e.g., sent to mailing lists) that a large number of nodes will 
receive. Clearly, incorporating a white-list based scheme for 
processing messages from mailing lists at the level of a cy- 
fteralter ego, would be the best way of handling such attacks. 



B. System Resilience against User Unreliability 

The users of the system are dynamic. Namely, users 
will logon and logoff as they wish. Since our system heav- 
ily relies on the underlying social email network, the nat- 
ural question will be: how many users in the system can 
be offline before the network is severely segmented into 
many small components? 

Alternately, we can re-phrase the above problem as fol- 
lowed: how many nodes in a network can randomly fail 
before the network becomes fragmented? It turns out 
that this problem has been extensively studied analyti- 
cally and numerically [15, 16]. Using site percolation the- 
ory, Cohen et.al. [16] shows that scale-free PL networks 
are extremely robust to random failures: for a PL net- 
work with PL exponent less than 3, the critical fraction 
of nodes, Pc, that needs to be removed for the network 
to fragment goes to 1 as the network size approaches in- 
finity. Furthermore, for a finite-size network with a large 
number of nodes on the order of tens of thousands, the 
critical fraction pc is well over 0.99. Since the social email 
network is a PL network with exponent close to 2, these 
results from site percolation theory is directly applicable. 

Therefore, the network will not be fragmented even if 
a massive number of system users suddenly leave. Alter- 
nately, one only needs a very small fraction of the users 
to be using the system before they can start successfully 
exchanging information. 



which nodes that the message has visited must be deleted 
before forwarding. This keeps all system communications 
to an acquaintance-acquaintance level. Fig. 10. 
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FIG. 10: Protecting Privacy: A simple diagram illustrating 
the secure version of message forwarding in the system. For 
example, note that both node C and node D do not know 
that the query comes from node A; similarly, node A does 
not know that the hit comes from node C. 



C. Simple Measures for Performance Improvement 

Spam Traps. When our proposed system is deployed 
in the real world, the initial number of users will be small. 
In fact, all collaborative spam filters must overcome this 
"initial hurdle" in order to become widely- used. 

Our proposed solution to this "initial hurdle" problem 
is to install spam traps. By definition, a spam trap is an 
email account created for the sole purpose of attracting 
spams. These spam trap addresses can be easily pro- 
moted throughout the internet to attract a large number 
of spams. It has been noted by a commercial anti-spam 
company that only a few hundred well-spread spam traps 
are needed to catch almost all new spams [31]. These 
spam traps are not difficult to initiate and they do not 
cost much in bandwidth and memory storage to main- 
tain. With spam traps properly installed, the system is 
ready to be deployed and oSev superior spam detection 
performance. 

Hybrid and Multi Tier Design. As discussed in 
section V, legitimate emails from popular mailing lists 
can easily become blacklist targets of the malicious users 
of the system. In addition to the trust scheme we pro- 
posed in section II C, any traditional spam filtering tech- 
nique can be utilized as DefinitelySpam and Definitely- 
NotSpam function in Algorithm 1 to achieve enhanced 
performance, and plug security holes in the collaborative 
system. 
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VII. CONCLUDING REMARKS 

Our fairly comprehensive simulation results show that 
global social email networks possess several properties 
that can be exploited using recent advances in complex 
networks theory (e.g., the percolation search algorithm) 
to provide an efficient collaborative spam filter. Clearly, 
the proof-of-conccpt system discussed here can be vastly 
improved and augmented with schemes that have proven 
successful at various levels. Moreover, there is nothing 
special about searching for and caching spam digests, 
and one can use our pervasive message passing system 
for managing a general distributed information system. 
The primary requirement is to be able to provide enough 
benefits to the users so that they are motivated to co- 
operate, which is relatively easily accomplished when it 
comes to spam management. If users get used to the 



spam filtering system, then we envision that queries for 
other information will follow. 

The study brings out several aspects of the burgeon- 
ing cyberspace networks, and the increasingly powerful 
Cyberalter ego: (i) They have some of the same charac- 
teristics as their real-life counterparts, and hence, can be 
managed and explored using well-studied schemes; (ii) 
In many P2P applications, we do not need to explicitly 
define new links and form the network from scratch, but 
existing cyberspace and social contacts can be exploited 
as an efficient P2P infrastructure. Such existing networks 
combined with efficient tools, borrowed from the field of 
complex networks, can achieve almost optimum perfor- 
mance. This work and similar recent concepts [8] con- 
stitute some of the first steps toward the management 
and design of efficient and naturally grown collaborative 
systems in the cyberspace. 
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